This project aims to explore Airbnb listings throughout New York city with ggplot2 in R. We will also be using Chi-Square Test and ANOVA to look at the relationship between different variables. Airbnb project will use functionalities in R studio to capture insight into the dataset. The data is sourced from Kaggle. It includes the listings from Airbnb’s in New York City. The variables in this dataset include neighborhoods, room type, locations (longitude and latitude), reviews, prices, and availability.
This tutorial will take the reader through a step-by-step process to explore data in R using ggplot2.
At the end of the tutorial, readers will be able to create plots in R using ggplot2 and be able to create an ANOVA model and conduct a Chi-Square test. The objective of this project is to uncover insights into the market of Airbnb’s in NYC and look at any trends in the data.
Welcome to the ggplot tutorial in R. This lesson will demonstrate how we can use the ggplot2 library in R to create data visualization and run analysis. We will use the Airbnb data set, which has the data for the Airbnb’s in New York City. We will unveil trends in the data and understand more about this exciting data set. Come along as we explore all the Big Apple has to offer.
Objectives
There are a few main objectives of this tutorial including the following:
Key Takeaways
Data Source and Variables
What is ggplot?
ggplot is a system for creating graphics
Basics
+(%+%= This is used to add
components to a plotGeoms
# run the tidyverse package and dplyr which will also be used throughout the tutorial
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
# import data and save as airbnb_data
airbnb_data <- read.csv("airbnb.csv")
# look at the data, it will show how many rows and columns there are
glimpse(airbnb_data)
## Rows: 48,895
## Columns: 16
## $ id <int> 2539, 2595, 3647, 3831, 5022, 5099, 512…
## $ name <chr> "Clean & quiet apt home by the park", "…
## $ host_id <int> 2787, 2845, 4632, 4869, 7192, 7322, 735…
## $ host_name <chr> "John", "Jennifer", "Elisabeth", "LisaR…
## $ neighbourhood_group <chr> "Brooklyn", "Manhattan", "Manhattan", "…
## $ neighbourhood <chr> "Kensington", "Midtown", "Harlem", "Cli…
## $ latitude <dbl> 40.64749, 40.75362, 40.80902, 40.68514,…
## $ longitude <dbl> -73.97237, -73.98377, -73.94190, -73.95…
## $ room_type <chr> "Private room", "Entire home/apt", "Pri…
## $ price <int> 149, 225, 150, 89, 80, 200, 60, 79, 79,…
## $ minimum_nights <int> 1, 1, 3, 1, 10, 3, 45, 2, 2, 1, 5, 2, 4…
## $ number_of_reviews <int> 9, 45, 0, 270, 9, 74, 49, 430, 118, 160…
## $ last_review <chr> "2018-10-19", "2019-05-21", "", "2019-0…
## $ reviews_per_month <dbl> 0.21, 0.38, NA, 4.64, 0.10, 0.59, 0.40,…
## $ calculated_host_listings_count <int> 6, 2, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 3, …
## $ availability_365 <int> 365, 355, 365, 194, 0, 129, 0, 220, 0, …
# change the names of the variables
colnames(airbnb_data) <- c("id", "Airbnb_name", "host_id","host_name","borough","neighborhood",
"latitude", "longitude","property_type", "price_per_night", "minimum_nights", "number_of_reviews", "last_review_date", "reviews_per_month", "total_host_listings", "availability_365")
# Use the dplyr library to clean the data
# The following functions remove rows that are missing data
airbnb_data <- airbnb_data %>%
mutate(reviews_per_month = reviews_per_month) %>%
filter(!is.na(reviews_per_month))
airbnb_data <- airbnb_data %>%
mutate(last_review_date = last_review_date) %>%
filter(!is.na(last_review_date))
#check the data again
glimpse(airbnb_data)
## Rows: 38,843
## Columns: 16
## $ id <int> 2539, 2595, 3831, 5022, 5099, 5121, 5178, 5203, 52…
## $ Airbnb_name <chr> "Clean & quiet apt home by the park", "Skylit Midt…
## $ host_id <int> 2787, 2845, 4869, 7192, 7322, 7356, 8967, 7490, 75…
## $ host_name <chr> "John", "Jennifer", "LisaRoxanne", "Laura", "Chris…
## $ borough <chr> "Brooklyn", "Manhattan", "Brooklyn", "Manhattan", …
## $ neighborhood <chr> "Kensington", "Midtown", "Clinton Hill", "East Har…
## $ latitude <dbl> 40.64749, 40.75362, 40.68514, 40.79851, 40.74767, …
## $ longitude <dbl> -73.97237, -73.98377, -73.95976, -73.94399, -73.97…
## $ property_type <chr> "Private room", "Entire home/apt", "Entire home/ap…
## $ price_per_night <int> 149, 225, 89, 80, 200, 60, 79, 79, 150, 135, 85, 8…
## $ minimum_nights <int> 1, 1, 1, 10, 3, 45, 2, 2, 1, 5, 2, 4, 2, 90, 2, 2,…
## $ number_of_reviews <int> 9, 45, 270, 9, 74, 49, 430, 118, 160, 53, 188, 167…
## $ last_review_date <chr> "2018-10-19", "2019-05-21", "2019-07-05", "2018-11…
## $ reviews_per_month <dbl> 0.21, 0.38, 4.64, 0.10, 0.59, 0.40, 3.47, 0.99, 1.…
## $ total_host_listings <int> 6, 2, 1, 1, 1, 1, 1, 1, 4, 1, 1, 3, 1, 1, 1, 1, 1,…
## $ availability_365 <int> 365, 355, 194, 0, 129, 0, 220, 0, 188, 6, 39, 314,…
Others: geom_smooth or stat_smooth() = smoothed condition means
Here is how you can customize a plot
Themes
Colors
Titles
library(ggplot2)
# run ggplot2 library
#start with ggplot() for each plot
#Scatter Plots
# 1.Create a scatter plot using the longitude as x and latitude as y. Use property type as the color and add a title and labels
ggplot(data = airbnb_data) +
geom_point(mapping=aes(x = longitude, y = latitude, color = property_type)) + labs(title = "Airbnb Locations by Property Type", x = "Longitude", y = "Latitude")
# 2.Create a scatter plot using the minimum nights as x and number of days available throughout the year as y. Use property type as the color and add a title and labels
ggplot(data = airbnb_data) +
geom_point(mapping=aes(x = minimum_nights, y = availability_365, color = property_type)) + labs(title = "Minimum Nights by Availability", x = "Min Nights", y = "Availability")
# 3.Create a scatter plot for the location using longitude and latitude. Make the plot purple and add a title and labels, this time using ggtitle and xlab/ylab. Use geom_smooth to look for patterns
ggplot(data=airbnb_data) +
geom_point(mapping = aes(x=longitude, y=latitude), color = "purple") +
xlab("Longitude") +
ylab("Latitude") +
ggtitle("Scatter Plot of Location") +
geom_smooth(method=lm, mapping=aes(x=longitude, y=latitude))
## `geom_smooth()` using formula = 'y ~ x'
#Box plots
# 1. Create a Box plot of prices by borough and fill with borough
ggplot(data = airbnb_data) +
geom_boxplot(mapping=aes(x = borough, y = price_per_night, fill = borough)) + labs(title = "Prices by Borough", x = "Borough", y = "Price")
# 2. Create a Box plot of reviews per month by borough and fill with borough
ggplot(data = airbnb_data) +
geom_boxplot(mapping=aes(x = borough, y = reviews_per_month, fill = borough)) + labs(title = "Reviews by Borough", x = "Borough", y = "Reviews per Month")
#Bar plots
# 1. Create a Bar plot of the boroughs
ggplot(data = airbnb_data) +
geom_bar(mapping=aes( x = borough)) + labs(title = "Borough Bar Plot", x = "Borough")
# 2. Create a Bar plot of the property type.
ggplot(data = airbnb_data) +
geom_bar(mapping=aes(x = property_type)) + labs(title = "Property Type", x = "Property Type")
#Histograms
#1. Create a histogram for the price distribution. Make the plot red. Set the width to 40.
ggplot(airbnb_data, aes(x = price_per_night)) +
geom_histogram(binwidth = 40, fill = "red") +
labs(title = "Distribution of Prices", x = "Price", y = "Frequency")
# To save a plot to a file use the ggsave()- save to a file
Faceting combines mulitple plots.
facet_grid() = forms a matrix using rows and columns based on the variables. This is used mainly when there are two discrete variables facet_wrap() = In the case that there is only one variable with multiple levels, use this function
# Let's use facet_wrap to combine boxplots for number of reviews with boroughs.
ggplot(data = airbnb_data) +
geom_boxplot(mapping=aes(y=number_of_reviews)) +
facet_wrap(~borough)
# facet for price per night using property type and using fill to specify the color of the plots
ggplot(data = airbnb_data) +
geom_boxplot(mapping= aes(x=price_per_night), fill = "orange") +
facet_wrap(~property_type)
# facet for price per night using borough and using fill to specify the color of the plots
ggplot(data = airbnb_data) +
geom_histogram(mapping= aes(x=price_per_night), fill = "purple") +
facet_wrap(~borough)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Descriptive Statistics
#Let's do some descriptive statistics for some of our variables
summary(airbnb_data$price_per_night)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 69.0 101.0 142.3 170.0 10000.0
summary(airbnb_data$minimum_nights)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 5.868 4.000 1250.000
summary(airbnb_data$number_of_reviews)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 3.0 9.0 29.3 33.0 629.0
#Let's check if there is a relationship between property type and borough.
#Chi-Square Test (a = .05): We will use a Chi-Square test to look at relationship between property type and borough.
cross_tab <- table(airbnb_data$property_type, airbnb_data$borough)
result <- chisq.test(cross_tab)
print(result)
##
## Pearson's Chi-squared test
##
## data: cross_tab
## X-squared = 962.33, df = 8, p-value < 2.2e-16
# Since the p-value is less than alpha, there is statistically significant relationship between room type and borough.
#ANOVA: We will now use ANOVA to compare the means (prices) between the different boroughs.
# Create an ANOVA model
model <- aov(price_per_night ~ borough, data = airbnb_data)
# Print a summary of the results
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## borough 4 4.507e+07 11267628 299.4 <2e-16 ***
## Residuals 38838 1.462e+09 37631
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Based on the results of the ANOVA model, there is evidence that the means of at least one group's mean is significantly different
There were a few challenges when tidying the data:
Missing Data:
Outliers:
Variable Names:
A few examples:
neighbourhood was spelled with the “u” which is correct in other regions like the UK but it was changed to the US spelling
neighbourhood_group was changed to borough because since the data is around NYC, the groups are called boroughs
Also, room_type was changed to property_type because an Airbnb can be an entire apartment/house as well. It is not limited to a room
Price was changed to price_per_night to be more specific. Price could mean per hour, month, or anything. It is important to be clear
To conclude, we learned the basics of ggplot in R using the NYC Airbnb from Kaggle. This library allowed us to dig deeper into properties in NYC and uncover insights about Airbnb’s like pricing, locations, and distribution in neighborhoods.We are now able to use ggplot to visualize “the city that never sleeps.”
If you would like to explore ggplot more, or even other visualization in R check out the recommended resources below:
ggplot2 reference. • ggplot2. (n.d.). https://ggplot2.tidyverse.org/reference/
RPubs. (n.d.). https://rpubs.com/
Wickham, H. (2023). R for Data Science: Import, Tidy, transform, visualize, and model data. O’Reilly.